Predictive Overlapping Co-Clustering

نویسندگان

  • Chandrima Sarkar
  • Jaideep Srivastava
چکیده

In the past few years co-clustering has emerged as an important data mining tool for two way data analysis. Coclustering is more advantageous over traditional one dimensional clustering in many ways such as, ability to find highly correlated sub-groups of rows and columns. However, one of the overlooked benefits of co-clustering is that, it can be used to extract meaningful knowledge for various other knowledge extraction purposes. For example, building predictive models with high dimensional data and heterogeneous population is a non-trivial task. Co-clusters extracted from such data, which shows similar pattern in both the dimension, can be used for a more accurate predictive model building. Several applications such as finding patient-disease cohorts in health care analysis, finding user-genre groups in recommendation systems and community detection problems can benefit from co-clustering technique that utilizes the predictive power of the data to generate co-clusters for improved data analysis. In this paper, we present the novel idea of Predictive Overlapping Co-Clustering (POCC) as an optimization problem for a more effective and improved predictive analysis. Our algorithm generates optimal co-clusters by maximizing predictive power of the co-clusters subject to the constraints on the number of row and column clusters. In this paper precision, recall and f-measure have been used as evaluation measures of the resulting co-clusters. Results of our algorithm has been compared with two other well-known techniques K-means and Spectral co-clustering, over four real data set namely, Leukemia, Internet-Ads, Ovarian cancer and MovieLens data set. The results demonstrate the effectiveness and utility of our algorithm POCC in practice.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Generalized Framework for Mining Arbitrarily Positioned Overlapping Co-clusters

The goal of co-clustering is to simultaneously cluster both rows and columns in a given matrix. Motivated by several applications in text mining, recommendation systems and bioinformatics, different methods have been developed to discover local patterns that cannot be identified by traditional clustering algorithms. In spite of much research in this domain, existing co-clustering algorithms hav...

متن کامل

Model-based Overlapping Co-Clustering

Co-clustering or simultaneous clustering of rows and columns of two-dimensional data matrices, is a data mining technique with various applications such as text clustering and microarray analysis. Most proposed co-clustering algorithms work on the data matrices with special assumptions and they also assume the existence of a number of mutually exclusive row and column clusters, but it is believ...

متن کامل

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

Detecting Overlapping Communities in Social Networks using Deep Learning

In network analysis, a community is typically considered of as a group of nodes with a great density of edges among themselves and a low density of edges relative to other network parts. Detecting a community structure is important in any network analysis task, especially for revealing patterns between specified nodes. There is a variety of approaches presented in the literature for overlapping...

متن کامل

Robust Overlapping Co-clustering

Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. On such datasets, in order to accurately identify meaningful clusters, both non-informative data points and non-discriminative features need to be discarded. Additio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014